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GAIN  MODIFICATION  IN  A  BACKWARD  PROPAGATION  NEURAL  NETWORK 


Our  research  into  backward  propagation  has  led  to  a  number  of  new  theoretical  and 
empirical  results.  We  have  developed  a  generalized  version  of  backward  propagation 
which  incorporates  gain  modification.  In  our  generalized  network,  both  gains  and 
synapses  are  modified  by  a  backward  propagation  procedure.  Synapses  are  modified  in 
proportion  to  the  negative  gradient  of  the  energy  with  respect  to  the  synaptic  weight  as  in 
ordinary  backward  propagation,  and  gains  are  modified  in  proportion  to  the  negative  partial 
derivative  with  respect  to  the  gain.  Since  the  resulting  error  signals  for  the  gain  and 
synaptic  weights  are  proportional  to  one  another,  the  computational  complexity  of  our 
generalized  network  is  comparable  to  that  of  the  original  backward  propagation  model. 

Simulations  of  the  new  network  have  been  performed  on  a  concentric  circle  paradigm  in 
two-dimensions.  In  the  concentric  circle  problem,  we  present  the  x  and  y  coordinates  of 
patterns  in  the  unit  circle.  Those  patterns  which  lie  outside  of  a  pre -determined  radius  are 
in  one  class,  while  those  interior  to  the  radius  belong  to  a  second  class.  In  our  technical 
report  "Gain  Modification  Enhances  High  Momentum  Backward  Propagation"  (Bachmann, 
1989)  we  demonstrated  that  a  combination  of  high  momentum  and  gain  modification  leads 
to  faster  convergence  rate  compared  with  high  momentum  alone.  Bare  backward 
propagation  converged  at  an  even  slower  rate,  as  expected.  The  definition  of  convergence 
for  this  study  was  that  the  network  response  for  all  patterns  fall  within  0.1  of  the  target 
output .  Additional  work  which  we  have  carried  out  since  the  publication  of  our  report  has 
shown  that  the  onset  of  generalization  for  this  paradigm  actually  occurs  on  fairly  short  time 
scales,  and  there  is  essentially  little  difference  in  generalization  between  momentum  and 
momentum  with  gain  modification  on  short  time  scales.  However,  both  of  these 
approaches  achieve  significantly  better  levels  of  generalization  than  bare  backward 
propagation  on  short  time  scales.  In  essence,  we  have  shown  that  with  momentum  or  a 
combination  of  gain  modification,  the  network  learns  to  generalize  rapidly  compared  to 
ordinary  backward  propagation.  However,  in  precisely  fitting  the  training  data,  the  best 
convergence  rate  is  achieved  by  a  combination  of  gain  modification  and  momentum. 

STATISTICAL  FORMULATION  OF  FEATURE  EXTRACTION 

Our  mathematical  analysis  of  unsupervised  learning  has  led  to  the  statistical  formulation  of 
the  parameter  estimation  problem  associated  with  unsupervised  learning  in  a  neural 
network.  The  network  is  presented  as  an  exploratory  projection  pursuit  method  that 


performs  feature  extraction  (or  dimensionality  reduction)  on  the  training  data  set.  The 
formulation,  which  is  similar  in  nature  to  PP,  is  based  on  a  minimization  of  a  cost  function 
over  a  set  of  parameters,  yielding  an  optimal  decision  rule  under  some  norm. 

We  have  presented  a  new  projection  index  (cost  function)  that  favors  directions  possessing 
multi-modality,  where  the  multi-modality  is  measured  in  terms  of  the  separability  property 
of  the  data.  The  synaptic  modification  equations,  which  perform  the  minimization  of  the 
cost  function,  turn  out  to  be  similar  to  the  synaptic  modification  equations  governing 
learning  in  BCM  neurons  (Bienenstock,  Cooper,  and  Munro  1982).  This  has  led  to  a  new 
statistical  viewpoint  on  the  biologically-inspired  BCM  neuron,  making  it  a  plausible 
candidate  for  statistical  feature  extraction.  The  directions  (synaptic  weights)  sought  by  the 
neuron  maximize  some  kind  of  skewness  measure  of  the  projected  distribution  in  this 
direction,  which  is  one  of  the  measures  of  deviation  from  normality,  and  therefore  a 
direction  which  discovers  an  important  structure  of  the  high-dimensional  data. 

A  network  was  presented  based  on  the  multiple  feature  extraction  formulation.  Both  the 
linear  and  non-linear  neurons  were  analyzed. 

Part  of  the  analysis  of  the  synaptic  modification  equations,  which  are  stochastic  in  nature 
due  to  the  random  inputs,  was  to  compare  their  trajectories  with  a  deterministic  differential 
equation.  The  deterministic  equation  corresponds  to  the  average  (expected  value)  of  the 
random  differential  equation,  and  is  much  easier  to  handle  since  it  was  shown  that  it 
represents  the  gradient  of  our  projection  index. 

The  connection  between  the  synaptic  modification  equations  and  their  deterministic  version 
was  analyzed  by  extending  a  general  result  on  random  differential  equations  (Geman, 
SIAM  1979).  This  work  concerns  differential  equations  which  contain  strong  mixing 
random  process.  Mixing  roughly  says  how  the  future  of  a  random  process  depends  on  its 
past.  The  solution  process  is  shown  to  be  well  approximated  in  a  probabilistic  sense  by  a 
deterministic  trajectory,  over  infinite  time  interval,  using  the  interplay  between  the  rate  of 
fluctuations  of  the  random  process,  and  the  rate  of  the  mixing. 
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Abstract  This  paper  concerns  differential  equations  which  contain  strong 
mixing  random  processes.  The  solution  process  is  shown  to  be  well  approx¬ 
imated  by  a  deterministic  trajectory,  over  an  infinite  time  interval,  using 
the  interplay  between  the  rate  of  fluctuations  of  the  random  process  and 
the  rate  of  the  ip  mixing.  An  application  of  the  result  is  given  for  analysing 
synaptic  modifications  in  Neural  Networks. 

1.  Introduction 

The  mathematical  theory  of  stochastic  differential  equations  is  concerned  mainly  whit 
the  study  of  It 6  equations  and  the  associated  Markov  process.  Mostly,  the  results  on  non 
ltd  type  equations  have  been  concerned  with  the  conditions  under  which  u :{(t)  converges  (as 
e  — »  0)  to  a  diffusion  process  on  finite  intervals  [0,  T’/e]  (cf.  Stratonovich,  1963;  Cogburn 
and  Hersh,  1973;  Papanicolaou  and  Kohler,  1974;  Blankenship  and  Papanicolaou  1977). 
Averaging  results  for  random  differential  equations  are  usually  discussed  in  conjunction  with 
the  law  of  large  numbers  Kohler  and  Papanicolaou  (1976)  with  the  central  limit  theorem 
for  (x((t)  -  y((t))/\/e  on  (0,  T]  (cf.  Khasininskii,  1966;  and  White  1976).  Goman  (  1979) 
showed  that  the  solution  process  of  a  random  differential  equation  which  contains  strong 
mixing  random  process  is  well  approximated  by  a  deterministic  trajectory  over  a  finite 
time  interval,  and  for  a  more  restricted  systems,  over  the  infinite  time  interval.  Analysis 
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analogous  to  that  was  carried  out  on  Ito  type  equations  by  Vrkoc  (1966),  and  by  Lybrand 

(1975). 

In  this  paper  we  shall  continue  the  direction  taken  by  Geman  and  approximate  the  solu¬ 
tion  process  by  a  deterministic  trajectory  over  an  infinite  time  interval,  using  the  interplay 
between  the  rate  of  fluctuations  of  the  random  process  and  the  rate  of  the  ^mixing,  yield¬ 
ing  a  result  for  a  wide  family  of  nonlinear  random  differential  equations.  We  will  establish 
conditions  under  which  the  random  solution  stays  close  in  L 2  sense  to  the  associated  deter¬ 
ministic  solution.  The  result  is  particularly  useful  when  a  converging  deterministic  equation 
is  approximated  by  a  random  equation  that  is  more  computationally  feasible.  Section  4  is 
devoted  to  such  an  application,  in  the  theory  of  synaptic  modification  in  Neural  Networks. 

Similar  analysis  was  carried  out  on  the  discrete  time  version  of  such  equations,  see  Ljung 
(1978),  Kushner  and  Clark,  (1978),  Dupuis  and  Kushner  (1987),  and  the  references  therein. 

2.  Formulation  and  statement  of  the  problem 

In  this  section  we  briefly  summarize  the  relevant  results  form  Geman  (1977,  1979). 

Let  <{>(t, u>)  be  a  bounded  stationary  stochastic  process  with  Tq  and  Tf*  the  cr-fields 
generated  by  {<^(r, a/)  :  0  <  r  <  t},  and  {0(r, u>)  :  t  <  t  <  oo}  respectively.  Let  the  signed 
measure  vtis  be  defined  on  (G  x  G,  JFq  x  JF0^)  by 

vt,s  =  P{u  :  (w,w)  G  B)  -  P  x  P(B),  for  B  €  ^  x 

For  any  {B  6fjX  the  set  ^u;  :  (u>,u;)  £  is  in  JF,  and  since  it  is  also  a  monotone 

class,  v  is  well  defined.  The  stochastic  process  o(t,^),  is  said  to  have  Type  II  mixing  if 

<p{6)  =  sup  sup  |  wt,<(.4)  |  —  0. 
t>o 

Remark  on  <p  mixing:  The  results  we  describe  hold  for  Type  I  mixing  as  well,  both  of 
which  were  introduced  by  Volkonskii  and  Rozanov  (1959),  since  for  both  types  of  mixing 
we  have  |v|tii(G  X  G)  < 
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Let  f  be  a  positive  number,  and  consider  the  system: 

=  H(xe(t,  u>),u>,t/e), 
yt(t)  =  Ge(y((t),t), 
xt(0,uf)  =  yt(0)  =  xQ  e  Rn. 

Assume: 

1.  H  is  jointly  measurable  with  respect  to  its  three  arguments. 

2.  Gt(x,t)  =  I?[i/(a;(s,a>),  t/e)],  and  for  all  i  and  j 


Q 

— — Gi(x,  t)  exists,  and  is  continuous  in  (x,t). 
OXj 


(2.1) 


3.  For  some  T  >  0: 

a.  There  exists  a  unique  solution,  x(t,  u>),  on  [0,  T]  for  almost  all  u;;  and 

b.  A  solution  to 

d 

—  g{t,s,x)  =  G{g{t,s,x),t),  g(s,s,x)  =  x , 

exists  on  [0,T]  x  [0,  T]  x  Rn. 

The  following  notations  will  be  used: 

1.  d=  H(xf(t, u>),u>, t/c) 

2.  g,(t ,  s ,  x)  =  ( d/ds)g(t ,  s,  x). 

3.  gx(t,s,x)  =  the  n  x  n  matrix  with  ( i ,  jr )  component  (d/dxj)g,(t,s,x). 

4.  For  H(x,ijj,t)  define  the  families  of  cr- fields  Fq  and  such  that,  for  each  t  >  0, 
contains  the  <r-field  generated  by 


{ H(x,w ,  r) :  0  <  r  <  /,  x  e  Rn}, 


and  Tf0  contains  the  cr-field  generated  by 

{//(x,u>,t)  :  t  <  r  <  oo,x  €  /?n}- 
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The  relation  between  the  random  differential  equation  and  its  averaged  version  for  system 
(2.1)  under  conditions  (1),  (2),  and  (3)  is  given  by: 

Lemma  (Geman  1977)  For  any  Cl  function  K  :  Rn  —*  Rl  and  t  6  (0,T): 

E[K(x(t))}  =  K{y(t))+  f  f  (■%-K{g{t,s,x{s,u>))j)  ■  H(x(s,u>),T>,$)dv,i0ds, 

J0  J flxfl  ' ax 

provided  that 

^)»)  ■  h(x(s,u>),tj,s),  and 
(j^K(g(t,s,x(s,u>)))}  ■  H[x(s,u>),u>,s) 
are  absolutely  integrable  on  fi  X  fi  X  [0,  T],  with  respect  to  dP(uj)dP(T})ds. 

The  proof  of  the  lemma  is  based  on  the  relationship  between  the  initial  conditions  in 
time  and  in  space  for  an  ODE,  namely:  If  i?(t,  s,  z)  is  the  function  satisfying 

~g(t,s,x)  =  G(g(t,s,x),t) 

then 

=  —  9x(^t  •S  %  )C(^’ 

for  all  t  £  [0,oo),  s  €  [0,  oo),  and  x  £  Rn.  This  follows  from  the  observation  that  g(t,s,x) 
is  constant  along  trajectories  of  the  form  ( s,x(s ))  (cf.  Hartman,  1964  chap  5). 

Theorem  (Geman,  1977)  Finite  time  averaging.  Assume  also  that: 

4.  There  exist  continuous  functions  I?i(r,  t),  H:(r,  t),  and  B3(r,t),  such  that  for  all  i,  j ,  k\  r  > 
0,  and  u;: 

a.  |  Hi(x,u>,t,r)  |<  Hi(|  x  NO? 

b.  \(d/dzj)Hi(x,u>,t,T)\<Bi(\z\,t)', 

c.  |  ( di/dxjdxk)IIi{x,u,t,T )  |<  B3( J  x  |,f). 

5.  supo0it€(0  ;rj  |  y((t)  |<  fl4  for  some  H4  and  T. 
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Then 


in  probability. 


sup  |  Xt(t)  -  yt(t)  I  — ->0 

t£[0,T]  *~*° 


3.  Averaging  on  [0,  oo) 

When  averaging  on  an  infinite  interval  we  require  that  e  be  a  function  of  t  and  c  \  0, 
meaning  that  the  mixing  rate  becomes  stronger  in  time.  More  specifically,  let  e  be  a  function 
of  the  form  e(t)  =  eoe(t)  where  c  is  monotonically  decreasing  to  zero  in  time. 

The  above  lemma  still  holds  when  z,  H ,  g  and  G  are  replaced  by  x(,  He,  gt  and  Gt 
respectively,  and  also  when  e  becomes  a  function  of  t. 

In  order  for  the  approximation  to  hold  on  [0,  oo)  we  require  that  By ,  B 2,  £3  are  constants 
in  condition  4  (this  will  be  relaxed  later)  extend  condition  5  to  hold  for  t  6  ^0,  oc),  and  add 
the  following  relation  between  the  rate  of  the  mixing  of  H  and  the  convergence  of  t  to  zero: 

6.  3  7  >  0,  c  >  0,  such  that  y?(£)  <  6~y,  and  e(<)  <  t~^  +  l  +  c\  for  a  monotone  decreasing 
c. 

Theorem  3.1  Assume  He  is  of  Type  II  <p  mixing,  and  satisfies  condition  1-0.  then 

lim  sup  E  i  xf(t)  -  yf(t)  |2=  0. 

‘o—o  t>0 


Proof:  Assume  first  that  t  is  an  integer.  Fix  e0  and  apply  the  lemma  to  the  system  using 
K{x)  =\x-  y((t)  |2: 


£|*«(0-y«(0l2= 


=  1  J  Jn  •  7/((zt(.s,u.'),  T},s)dv,fids 

<  Y]  I  /  /  (-^~K{gt(t,s,xf(s,u>))))  ■  IIt(x<{s, -;),?/, s)<ivafi<is  \ 

JTj  Jk~  1  Jnxn  ' 
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For  any  fixed  6k  >  0  (to  be  chosen  later),  since  each  integral  is  bounded  we  can  write  V£: 


L  in  •  H+x^U^dv^s  = 

I  ~  I  n{j^K(9t(t>syxt{s,uj)))^  ■  Ht(xt(s,Lj),rj,s)dv,i0ds 

II  +  /*  f  (^-K(9t{t,syxf(s~6k^)))). 

Jk-l+6k  JcixO  XOX  > 


Ht(xt(s  -  Sk,w)yT),s)dv,fids 


+  /  f  {(■^~A'(9t(t,s,xe(s,Lj))))  •  Ht(,xe(s,u;),T}ys) 

Jk-l+6kJnxCl  ' 

/  d 

III  -  3y  xf(s  -  6kyu>)))j  •  H£{x((s  -  6k,u>)yT)ys)}dv,i0ds. 


The  bounds  on  x(  and  its  derivatives,  and  the  smoothness  of  K  imply  that  I  is  0(6k)-  In 
the  second  term  we  can  replace  uJ(o  by  v,-6,t  since  these  measures  agree  on  (Q  x  0,  T$~s  x 
^F^°)y  s  >  6,  and  since  2t(s  —  6y u)  is  Jr^~s  measurable.  Since  vt<6  is  the  difference  of  two 
'  probability  measures,  the  total  variation  measure  satisfies: 


Mt.«(n  x  fi)  <  2,  and  |u|t,«(fi  x  Cl)  =  2  ^  U'kn.4). 

therefore,  with  Type  II  (or  I)  mixing:  X  Cl)  <  2*?{6).  Applying  this  to  the  second 

integral  and  using  the  above  bounds  again  we  get  that  II  is  0(v(*fc/«(*  ~  1 ) ) )  •  The  last 
term  is  also  0(8k)  from  the  smoothness  of  Ht  and  of 

Now  choose  6k  —  k  >  1,  then  since  t{k  -  1)  <  t0(k  -  1 ) 

we  get  6k/((k  -  1)  >  -j^(k  -  l)’  +  ^c.  From  the  condition  on  we  have  <f(6k/t(k  -  1))  < 
ejy(k  -  1)~<1  +  5'|,C1.  Since  7  >  0,  the  sum 

£o(Sk)  +  o(rf6kMk-  1)))  -  o(rJ(l+,>). 

fc>i 

For  the  segment  of  t  between  two  integers,  an  analogous  argument  is  applied  yielding  an 
extra  term  of  the  form  0(ej  +  cj  ).  therefore  E  |  xf(t)  -  y((t)  J2  =  O  (c,j'(  1  '*)  j  uniformly  in 

t. 


wop  12  v2.1 1 


6 


N.  Intrator 


March  9,  J990 


This  implies  that 


sup£  |  xt(0  -  J /«(<)  |2=  o(4(l+’r)), 

t>  0  V  / 


lim  sup  E  |  xe(f)  -  y((t)  |2=  0. 
<o— °  t>o 


0 


The  following  problem  is  closely  related:  For  fixed  u),  let  H(x,u>,t)  map  Rn  x  Rm  x  Rl 
into  Rn.  Assume  that  for  each  x,  H(z,u>,t)  is  a  mixing  process,  and  for  each  x  and  t  define 
G(x,£)  =  E[H(z,u>,t) ].  Consider  the  random  equation 


=  Zff(zi(t,ur)tu/,t),  Zi(0,  u>)  =  x0, 


(3.1) 


with  its  averaged  equation 


yi(t)  =  (G(yi(t),t),  y£(0)  =  z0.  (3.2) 

For  equation  (3.2)  condition  6  becomes: 

6'.  37  >  0,  such  that 

i)  <p(6)  <  S~y, 

ii)  Z(t)  =  e0 for  p  =  c  >  and  Vf:  0  <  C]  <  r(t)  <  c;. 

Theorem  3.2  Under  the  assumptions  of  theorem  (3.1)  and  (6'); 

lim  sup  E  |  x£(t)  -  yiit)  |2  =  0. 

fo~°  t>0 

Proof:  Apply  the  change  of  variables:  t  =  ^ t2  +  c,  dt  —  ^(2  t  c)rl  “ cdr ,  to  equation  (3.1 ): 

it(r,w)  =  r-^2+c)r(r2+')/f<(xt,a;,TI+c/(o)(2  -c)rl~ 

=  r(r2  +  e)//<(xt,u;,  r/((r)}, 

for  c(r)  =  €qt  — U ^c) ,  Now  observe  that  c  s.atislics  condition  (0)  in  theorem  (3.1),  which 
gives  the  desired  result.  0 
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As  can  be  seen  from  the  proof,  p  has  to  satisfy  the  conditions  ^  <  p  <  1,  and  i(t)  has 
to  be  greater  than  t_1  so  that  r(t)  >  co  >  0,  which  allows  the  invocation  of  the  previous 
theorem.  It  follows  that  if  e(t)  =  t~l ,  a  convergence  is  assured  for  any  Type  II  mixing. 
Obviously,  p  may  be  larger  than  1  since  e  may  be  split  into  two  functions,  one  bounded 
and  the  other  satisfying  the  conditions  of  the  theorem.  The  same  argument  holds  for  r((), 
however,  it  is  clear  that  one  would  like  e  to  go  as  slow  as  possible  to  zero,  since  then  if 
the  averaged  version  has  a  limit,  the.  convergence  rate  of  both  equations  to  that  limit  is 
inversely  proportional  to  p. 

It  is  possible  to  extend  the  theory  to  the  cases  where  the  partial  derivatives  of  H  have  a 
polynomial  growth  it  time.  Then  e  has  to  decrease  faster  so  that  the  above  integrals  may 
still  be  controlled.  We  get  the  following  theorem: 

Theorem  3.3  Assume  that  B\,  f?2,  B$,  and  Bi  are  bounded  by  ta  for  some  a  >  0  in 
condition  4  of  theorem  3.1,  and  replace  condition  6  with  the  following: 

6.  3  7  >  0,  c  >  i,  such  that  y?(5)  <  £~7,  and  t(t)  <  t~(l+c+ia\  for  a  monotone  decreasing 
e.  Then 

lim  sup  E  j  zf(t)  -  yt{t)  j2=  0. 

*o—0  {>0 

Proof:  When  applying  the  lemma  as  before  we  get  the  following: 

/  =  ^OW-!):a 

k 

u  =  j2°MWk~  m--a 

k 

k 

Now  chose  5*.  =  v/i0(^  -  1)~(I  +  ^C~  i  )+3q),  then  since  ((t)  <  t_(l+c+ '3a),  we  get  just  as 
before  6kU{k  —  1)  >  —  l)>^e-’^.  The  rest  of  the  proof  follows  exactly  as  before.  0 

Ve0  ' 
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Extending  theorem  3.2  to  the  case  where  the  partial  spatial  derivatives  are  bounded  by 
a  polynomial  in  t  is  done  by  absorbing  the  growth  of  H  into  e,  which  gives  the  following 
corollary: 

Corollary  3.4  Assume  that  J?i,  B 2,  B3  and  B 4  are  bounded  by  ta  for  some  a  >  0  in 
condition  4  of  theorem  3.1,  and  replace  condition  6  in  theorem  3.2  with  the  following: 

6'.  37  >  0,  such  that 

i)  <p(6)  <  S-i, 

ii)  i(t)  =  eortOH®-*  p>,  for  p  =  c  >  i,  and  Vt:  0  <  ci  <  r(t)  <  c?.  Then 

lim  sup  E  |  xi(t)  -  yi{l)  |?-  0. 

«o-*0  t>0 

An  important  observation  has  to  be  made  here:  If  the  deterministic  version  represents 
a  converging  trajectory,  e.g.,  if  the  equation  represents  a  gradient  descent,  then  as  long  as 
?(t)  >  f_l,  the  deterministic  version  will  still  converge  to  a  true  local  minimum,  however 
if  e(i)  <  r  \  then  /“  «(r)  <  00,  and  so  the  convergence  of  the  deterministic  equation  is 
not  assured,  whicl  implies  that  the  convergence  of  the  stochastic  version  to  a  true  local 
minimum  is  not  granted. 

4.  An  application  to  the  synaptic  modification  equations  of  a  BCM  neuron 

In  this  section,  we  apply  the  theorem  to  a  random  differential  equation  representing  the 
low  governing  synaptic  weight  modification  in  the  BCM  theory  for  learning  and  memory 
in  neurons,  Bienenstock  et  al.  (1982).  We  start  with  a  short  review  on  the  notations  and 
definitions  of  BCM  theory,  a  more  thorough  review  can  be  found  in  Intrator  (1990),  and 
the  references  therein. 

Consider  a  neuron  whose  input  is  the  vector  1  —  (j[, . . . ,  x ,v),  has  a  synaptic- weight 
vector  m  =  (mj, . . .,  m/v),  both  in  /Z‘v,  and  activity  (in  the  linear  region)  c  -  x  ■  in.  The 
input  x  is  assumed  to  be  a  stochastic  process  of  Type  II  y?  mixing,  bounded,  and  piecewise 
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constant.  Let  ©m  =  E[{z  •  m)2],  <j>(c,  0m)  =  c2  —  ^c©m.  c  represents  the  linear  projection 
of  z  onto  m,  and  we  seek  an  optimal  projection  in  some  sense. 

The  BCM  synaptic  modification  equations  are  given  by: 

m  =  fi(t)4>(z  •  m,Qm)x,  m(0)  =  m0,  (4.1) 

their  averaged  version  is  given  by: 

m  =  n(t)E^4>(x  •  m,©*,)*!,  m(0)  =  m0.  (4.2) 

}i(t)  is  a  global  modulator  which  is  assumed' to  take  into  account  all  the  global  factors 
affecting  the  cell,  e.g.,  the  beginning  or  end  of  the  critical  period,  or  state  of  arousal  (Bear 
and  Cooper,  1988). 

Equation  (4.2)  is  shown  to  be  a  dimensionality  reduction  method  based  on  a  cost  function 
that  favors  directions  m  for  which  the  distribution  of  the  inputs  is  different  from  normal  by 
means  of  skewness  (Intrator,  1990). 

Our  aim  is  to  show  the  convergence  of  the  stochastic  differential  equatio’i.  This  will  be 
done  in  two  step;  First  we  show  that  the  averaged  deterministic  equation  converges,  and 
then  we  use  theorem  3.2  to  show  the  convergence  of  the  random  differential  equation  to  its 
averaged  deterministic  equation. 

The  convergence  of  the  deterministic  equation 

Without  loss  of  generality,  we  may  assume  that  the  random  process  i  is  in  the  unit  ball 
in  RN ,  and  Var(a:  •  m)  >  A||  m  ||2  >  0,  which  simply  says  that  x  does  not  lie  in  a  subspace 
or  a  manifold  of  RN .  Since  we  are  interested  in  dimensionality  reduction,  we  can  always 
reduce  a-priori  the  dimensionality  of  x  so  that  it  will  span  N  for  some  .V.  When  the  theory 
is  applied  to  a  finite  value  random  vector,  we  can  restrict  m  to  be  in  the  span 

of  X[ , . . . ,  xn. 
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When  we  multiply  both  sides  of  the  above  equation  by  assuming  none  of  its  compo¬ 
nents  is  zero,  we  get: 

\  I!  II  =  E[(x  ■  ^v)3]  -  Ie2Kx  •  ^Ai)2] 

<  II  ™m  II3  ~  |var*(*  •  m^) 

<11  ™m  II3-  |  A2 1|  |!4 

=  11  ||3{1-  | A2||  mM  ||}, 

which  implies  that  ||  mp  ||  <  .  0 

Using  this  fact  we  cam  now  show  the  convergence  of  mM.  We  observe  that  =  -Vi?, 
where  =  —  £{E[(z  •  m^)3]  —  E*[(x  ■  m^)3]}  is  the  risk.  R  is  bounded  from  below 

since  ||  |j  is  bounded,  therefore  converges  to  a  local  minimum  of  R.  <) 

The  convergence  of  the  stochastic  equation 

Claim  Under  the  above  conditions  m^(f)  converges  in  L 3  to  a  local  minimum  of  the  risk. 

Proof:  The  calculation  above  implies  that  is  bounded  for  (almost)  every  ft. 

In  our  case  Bi,B2,  B$  and  B\  are  independent  of  t  or  m therefore,  if  we  replace  t(t) 
by  fi(t)  and  apply  theorem  3.2,  we  get 

sup  E\Tn„(t)  -  mM(f)|:  — ♦  0. 

(>0  Mo— 0 

rhfi  the  solution  to  the  deterministic  equation  will  converge  to  the  same  local  minimum 
y,  V/i  if  no  <  C,  for  some  positive  constant  C.  therefore  we  can  choose  T  for  which 
—  y|  <  |,  Ho  <  C,  t  >  T,  then  for  t  >  f  we  have: 

I™m(0  -  y  |<  | m^{t)  -  m„(f)|  +  |»h^(0  -  y  |<  |mM(0  -  »M0l  +  7y 

6  6 

=>  sup  E\m„{t)  -  y  |<  sup  E\m„{t)  -  mM(t)\  +  -  — • 

v>t  f>r  -mo-0. 
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S  is  arbitrary,  which  implies  that 

-  y  |  — >0 
n— o 


0 


5.  Summary 

It  has  been  shown  that  under  mild  conditions,  the  equations  i(  =  e/f(x,u>,  t),  and  yt  = 
cG(y,t)  where  G(z,t)  =  E[H(z,u/,t)\,  have  close  trajectories  in  the  infinite  interval  when 
e(t)  <  The  result  may  be  computationally  useful,  and  as  has  been  shown  in  the 

example,  may  assist  in  the  analysis  of  the  random  differential  equation. 
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